Goto

Collaborating Authors

 larger model




High-level remarks

Neural Information Processing Systems

We thank the reviewers for their detailed and thoughtful comments. These are not new and have been presented thoroughly in the submitted paper. Our intention was not to challenge the momentum mechanism. Combining SwA V with a momentum encoder and/or a large memory bank are indeed interesting follow-ups. In Tab.5, we make a best effort fair comparison (same data augmentation, num.



Deep Ensembles Work, But Are They Necessary?

Neural Information Processing Systems

Ensembling neural networks is an effective way to increase accuracy, and can often match the performance of individual larger models. This observation poses a natural question: given the choice between a deep ensemble and a single neural network with similar accuracy, is one preferable over the other? Recent work suggests that deep ensembles may offer distinct benefits beyond predictive power: namely, uncertainty quantification and robustness to dataset shift. In this work, we demonstrate limitations to these purported benefits, and show that a single (but larger) neural network can replicate these qualities. First, we show that ensemble diversity, by any metric, does not meaningfully contribute to an ensemble's ability to detect out-of-distribution (OOD) data, but is instead highly correlated with the relative improvement of a single larger model. Second, we show that the OOD performance afforded by ensembles is strongly determined by their in-distribution (InD) performance, and - in this sense - is not indicative of any effective robustness. While deep ensembles are a practical way to achieve improvements to predictive power, uncertainty quantification, and robustness, our results show that these improvements can be replicated by a (larger) single model.


Efficiently Computing Local Lipschitz Constants of Neural Networks via Bound Propagation

Neural Information Processing Systems

Lipschitz constants are connected to many properties of neural networks, such as robustness, fairness, and generalization. Existing methods for computing Lipschitz constants either produce relatively loose upper bounds or are limited to small networks. In this paper, we develop an efficient framework for computing the $\ell_\infty$ local Lipschitz constant of a neural network by tightly upper bounding the norm of Clarke Jacobian via linear bound propagation. We formulate the computation of local Lipschitz constants with a linear bound propagation process on a high-order backward graph induced by the chain rule of Clarke Jacobian. To enable linear bound propagation, we derive tight linear relaxations for specific nonlinearities in Clarke Jacobian. This formulate unifies existing ad-hoc approaches such as RecurJac, which can be seen as a special case of ours with weaker relaxations. The bound propagation framework also allows us to easily borrow the popular Branch-and-Bound (BaB) approach from neural network verification to further tighten Lipschitz constants. Experiments show that on tiny models, our method produces comparable bounds compared to exact methods that cannot scale to slightly larger models; on larger models, our method efficiently produces tighter results than existing relaxed or naive methods, and our method scales to much larger practical models that previous works could not handle. We also demonstrate an application on provable monotonicity analysis.


Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach

Kumar, Keshav

arXiv.org Artificial Intelligence

- Large Language Models (LLMs) demonstrate impressive generative abilities across a wide range of tasks but continue to suffer from hallucinations --outputs that are fluent yet factually incorrect. This paper introduces a reference-free, token-level hallucination detection framework that identifies unreliable tokens by analyzing variance in log-probabilities across multiple stochastic generations. Unlike traditional methods that depend on external references or sentence-level verification, our approach is model-agnostic, interpretable, and computationally efficient, making it suitable for both real-time and post-hoc analysis. We evaluate the proposed method on three diverse datasets--SQuAD v2 (unanswerable questions), XSum (abstractive summarization), and TriviaQA (open-domain question answering)--using autoregressive models of increasing scale: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Results show that token-level variance strongly correlates with hallucination behavior, revealing clear distinctions in uncertainty across model sizes. The framework maintains accuracy even under limited sampling conditions and introduces minimal computational overhead, supporting its practicality for lightweight deployment. Overall, this work provides a scalable, reproducible, and fine-grained diagnostic tool for detecting hallucinations in LLMs, with potential extensions to multilingual and real-time generation settings. Large language models (LLMs) have transformed natural language processing, powering tasks such as summarization, dialogue generation, and open-ended question answering.